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Abstract 

In a linear regression model with fixed dimension, we construct confidence sets for the unknown 
parameter vector based on the Lasso estimator in finite samples as well as in an asymptotic setup, 
thereby quantifying estimation uncertainty of this estimator. In finite samples with Gaussian errors 
and asymptotically in the case where the Lasso estimator is tuned to perform conservative model 
selection, we derive formulas for computing the minimal coverage probability over the entire param¬ 
eter space for a large class of shapes for the confidence sets, thus enabling the construction of valid 
confidence sets based on the Lasso estimator in these settings. The choice of shape for the confidence 
sets and comparison with the confidence ellipse based on the least-squares estimator is also discussed. 
Moreover, in the case where the Lasso estimator is tuned to enable consistent model selection, we 
give a simple confidence set with minimal coverage probability converging to one. 


1 Introduction 


The Lasso estimator as introduced in Tibshirani (19961 as well as many variants thereof have gained 


strong interest in the statistics community and in applied areas over the past two decades. As is well 
known, the main attraction of the Lasso estimator lies in its ability to perform model selection and 


parameter estimation at very low computational cost, see for instance Alliney & Ruzinsky (19941, Efron 


et al. (20041 and Rosset & Zhu (20071, and in the fact that the estimator can be used in high-dimensional 


settings where the number of variables p exceeds the number of observations n (“p ^ n”). 

Literature on distributional properties of the Lasso estimator in the low-dimensional setting (p < n) 
include the often-cited paper by Knight & Fu (20001 who derive the asymptotic distribution when the 


estimator is tuned to perform conservative model selection. Potscher & Leeb (20091 give a detailed 


analysis in the framework of a linear regression model with orthogonal design and derive the distribution 
of the Lasso estimator in finite samples as well as in the two asymptotic regimes of consistent and 
conservative tuning. Implications of these results for confidence intervals are analyzed in |P5tscher fc| 
Schneider (20101 and generalizations to a moderate-dimensional setting where p < n but p diverging 


with n are contained in Potscher & Schneider (20111 and Schneider (2015|. 


In a high-dimensional setting with p ^ n, confidence regions and confidence intervals in connection 
with the Lasso estimator have recently been treated in a number of papers including Zhang & Zhang 


20141 

Van de Geer et al. 

(20141, Javanmard & Montanari (20141, 

Ganer & Kock 

(20141 and 

Van de 


Geer (20151. All these papers use the idea of “de-sparsifying” the Lasso estimator which in the case of 


p < n essentially reduces to using the least-squares (LS) estimator for inference. In that sense this theory 
leaves a gap on how to construct confidence regions based on the Lasso estimator in a low-dimensional 
framework to provide uncertainty quantification for the Lasso estimator also in this case. 


Lee et al. (20131 consider finite-sample results for confidence intervals in connection with the Lasso 


estimator yet these authors take a different route in that their intervals are not set to cover the true 
parameter, but a pseudo-true value that depends on the selected model and coincides with the true 
parameter if the selected model is correct. All inference is conditional on the selected model. Their 


method is in line with the general proposal of Berk et al. (20131 who discuss an intricate procedure for 


obtaining confidence regions for this pseudo-true parameter after a model selection step. 

In this paper, we construct confidence sets based on the Lasso estimator for the entire unknown 
parameter vector. One of the challenges of this task lies in the well-known fact that the finite-sample 
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distribution of the Lasso estimator depends on the unknown parameter in a complicated manner. This 
phenomenon does not vanish for large samples as can be seen within a so-called moving-asymptotic 


framework (see Potscher & Leeb (20091 for a detailed analysis in orthogonal design) and also occurs for 
related estimators. In order to construct valid confidence sets, we need to know the smallest coverage 
probability occurring over the whole parameter space. Potscher & Schneider (20101 derive a formula 


for the minimal coverage probability of fixed-width confidence intervals based on the Lasso estimator in 
one dimension using knowledge of its finite-sample distribution. In the general case, this finite-sample 
distribution is not known, so it is not clear how to obtain an expression for the coverage probability in 
more than one dimension. Additionally, this coverage probability clearly depends on the shape that is 
used for the confidence set and it is a not clear a priori what this shape should be. We do the following. 

While the finite sample distribution and therefore the coverage probability for any kind of set based 
on the Lasso estimator is unknown in general dimensions, we show that computing the minimal coverage 
probability can actually be carried out without this explicit knowledge. We obtain an explicit formula 
for the minimal coverage probability by, in a way, deferring the minimization problem into the objective 
function that defines the estimator, as is depicted in Section For the confidence regions, we consider 
a large class of shapes that is determined by a condition involving the regressor matrix. This class 
encompasses the elliptic shape one would use if the confidence region was based on the LS estimator, 
thus enabling comparisons with the LS confidence ellipse. Analogously to the fixed-width intervals in 


Potscher & Schneider (2010|, the confidence regions we consider are random only through their centering 


at the Lasso estimator (which is also in line with the setup in the literature for high-dimensional settings, 
see for instance Van de Geer et al. (2014|). Asymptotically, we distinguish between two regimes for the 


tuning parameters which we call conservative and consistent tuning. As suggested from the results in 


Potscher & Schneider (20101, our results from finite samples essentially carry over asymptotically when 


the estimator is tuned conservatively. In the case of consistent tuning, the uniform convergence rate of 
the estimator is slower than and we give the asymptotic distribution of the Lasso estimator when 

scaled by the appropriate factor corresponding to the uniform convergence rate, as well as suggesting a 
simple construction for a confidence set in that case. 

The remaining paper is organized as follows. In Section we set the framework by stating the 
model, defining the estimator and introducing some notation. The main result giving the formula for 
the minimal coverage probability is presented in Section and subsequently Section ^ is devoted to 
discussing how to concretely construct the corresponding confidence sets, as well as their relationship 
to the confidence ellipse based on the LS estimator. In Section we derive asymptotic results both for 
the case of conservative and the case of consistent model selection. Section [^concludes. All proofs are 
deferred to Section [7l 


2 Setting and Assumptions 

Consider the linear model 

2/ = X^-ke, 

where y is the observed n x 1 data vector, X the n x p regressor matrix which is assumed to be non¬ 
stochastic with full column rank p, /? € is the true parameter vector and e the unobserved error term 
defined on some probability space (G,A, P) and consisting of independent and identically distributed 
components with mean 0 and finite variance We consider a componentwise tuned Lasso estimator 
/3 l, defined as the unique solution to the minimization problem 


p 

min L„((3) = min \\y - X|3||^ + 2^^ Kj 
PGRp 


IP, 


J\’ 


where A„j-, are non-negative and non-random componentwise tuning parameters that allow to exclude 
parameters from penalization. Note that if Xnj = 0 for all j, this estimator is equal to the ordinary 
least-squares (LS) estimator As and that = c > 0 for all j corresponds to the “classical” Lasso 
estimator as proposed by Tibshirani (19961. For later use, let A„ = (A„p,..., A„,p)' and A„ = diag(A„), 

We use 1 


the diagonal matrix whose diagonal elements are given by the components of A„ 


the indicator function and make the following obvious definitions. For a G 


and B C 


4 -} 


for 


the set 
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a + B = B + aCMP is defined as the set {a + b : b G B}. For a p x p matrix C and a scalar c, the sets 
CB and cB in are {Cb : b G B} C IRp and {cb : b G B} C Rp, respectively. Finally, for k G N, Ik 
stands for the k x k identity matrix and K denotes the extended real line M U {—oo, oo}. 


3 Finite Sample Results 


We aim to construct confidence sets for the entire parameter vector /3 based on the Lasso estimator /3 l. 
That means that for a non-random set M C IRp, we consider sets of the form 

— M = {^L — m : m G M}, 


which have to satisfy that the probability of actually covering the unknown parameter /3 never (for no 
value of P) falls below a prescribed level 1 — a with a G [0,1]. In other words, we need PisiP G Pi, — M) > 
1 —a for all P gW (where we stress the dependence of the probability measure on P whenever it occurs), 
so that 

inf Pb{P & Pi, — M) > 1 — a. 

iBeRp ^ 

In order to achieve this, we need to be able to compute this “infimal” (minimal) coverage probability. 
Throughout this and the subsequent section we suppose that the errors as normally distributed 

e~fV(0,a2/„), 


an assumption that will be removed for asymptotic results in Section We will show that the minimum 
occurs when the components of the unknown parameter become large in absolute value by essentially 
doing the following. We reparametrize the objective function defining the Lasso estimator so that the 
dependence on the unknown parameter becomes more transparent and easier to handle. We then consider 
the limiting cases of the objective functions when the components of the unknown parameter vector P 
become large in absolute value (that is, tend to -|-oo or —c»). We will see that it is possible to minimize 
the resulting objective functions explicitly, with minimizers that follow a shifted normal distribution that 
has the same variance-covariance matrix as the LS estimator and by construction do not depend on the 
unknown parameter. Finally, we will show that the infimal coverage probability of the proposed sets is 
indeed “achieved” for one of these finitely many limiting cases. 


To state the main theorem, we need several definitions. First we define the reparametrized objective 
function Qn{u) = Ln{P + — Ln{P) so that Qn is uniquely minimized at = rpl'^ifii, — P), the 

estimation error scaled by Of course, this scaling factor is arbitrary in finite samples, but proves 

to be of advantage when considering the problem in large samples in Section 5.1 We can write Qn as 


p 

Qn{u) = u'CnU — 2u'Wn + ^ A„j 


\uj+n^/'^Pj\ - \n^^'^Pj\ 


where C„ = X'Xjn and Wn = n ^I'^X'e ^ N{0,a‘^Cn)- Note that for a set M CRT we then have 

Pp{P Gp^- = Pp{un G M). 


The above mentioned limiting cases of the objective function that we consider are defined as 

p 

Qn{u) = u'CnU - 2u'Wn + ^ XnjdjUj, (1) 

i=i 

where d = (di,..., dp)' G {—1,1}^. Holding Wn fixed for a moment, we indeed see that 

Qn(u) = ^ lim Qn(u). 

As shorthand notation, we write for the unique minimizer of Qn- To define the shape that we 
want to consider for the confidence regions, we introduce the following notation. For m G R^, a vector 
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Figure 1: The set A^‘'{m) with m = (1.5,2)' and C = (_g 5 5 ®) along with the hyperplanes defining 
the set. The point m = (1.5, 2)' is displayed as a dot. 


d G { — 1, 1}^ and a matrix C G we define 


1=1 


: dj{Cm)j < dj{Cz)j,djZj < 0 }. 


The set A'^ (to) is an intersection of 2p half-spaces, p of which determine the orthant the set is located 
in via the parameter d. The other p half-spaces are defined by hyperplanes that intersect at the point 
TO. Figure [^shows one example of such a set. Note that in general, A^{m) could be non-empty also for 
sgn(TO) —d. The sets we consider are determined by the following condition. 

Condition A. Let C G he given. We say that a set M CM.P satisfies Condition with matrix C 

'i'f 


AUm) C M 


for all d G {—1,1}^ and for all to G M . 


The above condition will be discussed in more detail in Section Using this notation, we can now 
state the main theorem. 


Theorem 1. 


If Mn C RP is non-random and satisfies Condition with C = C„, then 
inf P/ 3 (wn G Mn) = min G M„), 

where Un A(—n“^/^C“^A„d, cr^C“^). 

The distributions of determining the formula for the infimal coverage probability are shifted 
normal distributions with the same variance-covariance matrix as the corresponding (shifted and scaled) 
LS estimator Wls = n^/^(,5Ls — /3) and mean that depends on the regressors and the vector of tuning 
parameters. Since Condition [X| for p = \ simply requires the corresponding set Mn to be an interval 
containing zero. Theorem is indeed a generalization of the formula in Theorem 5(a) in Potscher & 


Schneider (20101, as discussed in the introduction. (To make the connection, note that the tuning 
parameter rjn in that reference corresponds to a component of the vector of tuning parameters 

in our paper.) The following obvious corollary specifies the resulting valid confidence region based on 
the Lasso estimator. 
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Corollary 2 . Let 0 < a < 1. If Mn C IRp is non-random and satisfies Condition Pi] with C = Cn, as 
well as min£;g{_i i}p P{Un £ Mn) = 1 — a with Un N{—n~^/^Cn^Knd,(j‘^C~^), then 

inf Pp{l3 e /3i - n~^/‘^Mn) = 1 - a. 

3GRJ> 


4 Constructing the Confidence Set 


We now turn to discussing the important matter of how to choose an appropriate set M„ C for some 
desired level of confidence 1 — a by discussing concrete shapes for the confidence regions as well as their 
size and relation to confidence sets based on the LS estimator. As mentioned in the previous section, we 
need to find a set C IRp that satisfies Condition [ a| with (7 = C„ and such that minj;g{_i Pi^n ^ 
Mn) = 1 — a where 

N{-n-^/^C-^And,a^C-^). 

The resulting confidence set for /3 is then the scaled and shifted set — Mn/n^^'^. If we would base the 
set on the LS estimator /3 ls instead of /3 l, the canonical and best choice for M„ in terms of volume is 
an ellipse determined by the contour lines of a IV(0, cr^C'“^)-distribution, the Cn-ellipse. Given the fact 
that the variance-covariance matrix of the distributions of is in fact in addition to the fact 

that the means of the distributions average to 0, it is reasonable to consider the Cn-ellipse as a shape 
in connection with the Lasso estimator also. As stated in the following proposition, this shape complies 
with Condition El 

Proposition 3. The Cn-ellipse given by 

Ec„{k) = {zgW : z'CnZ < fc} 
satisfies Condition [a] with C = Cn for any k > 0. 

How to choose the parameter k for a given level of coverage 1 — a is stated in the next proposition. 
Proposition 4. For any k > 0, we have that 

argmin P £ Ec^{k)) = argmax ||C“^/^A„d||. 


Note that if d* solves the above optimization problem, so does —d*. To finally obtain the confidence 
ellipse based on the Lasso estimator, pick any such optimizer d* and compute fc* > 0 so that P{Un £ 
Ec^{k*)) = 1 — 0 , which is easily done numerically. Note that Proposition also shows that the 
ellipse Ec^{k*), and therefore the resulting confidence set based on the Lasso estimator, is larger in 
volume than the one based on the LS estimator, since Ec„{k*) needs to be large enough as to have 
mass 1 — 0 with respect to the A^(—n^/^C“^A„d*,(T^C“^)-measure whereas for the ellipse corresponding 
to the LS estimator, it suffices to have mass 1 — o with respect to the 1V(0, cr^C“^)-measure. Clearly, 
the difference in size will increase as the tuning parameters become large. These observations are in 
line with the findings in Potscher & Schneider (20101 who show that a confidence interval based on the 
Lasso estimator is larger than a confidence interval based on the LS estimator with the same coverage 
probability. When comparing the two confidence sets, we emphasize that since the ellipses are centered 
at different values, the smaller ellipse based on the LS estimator is in general not contained in the ellipse 
based on the Lasso estimator. This, as well as the difference in volume between the two ellipses, will also 
be illustrated in the example below. 


It is quite obvious that the C„-ellipse is not optimal as a shape for confidence sets based on the Lasso 
estimator since we can get higher coverage with a set of the same volume by adjusting the ellipse “towards” 
the contour lines of the A^(—(T^C'“^)-distributions (in such a way that Condition [a| is 
preserved). To find the best shape possible, one would have to minimize the volume of the set over 
all possible shapes satisfying Condition subject to the constraint of holding the prescribed minimal 
coverage probability. This is a highly complex optimization problem and we do not dwell further on 
this subject here, but illustrate possible ways to construct “good” sets, as shown in the example below. 
Before discussing this further, note that the following proposition shows that it is easy to find the closure 
of an arbitrary subset of with respect to Condition |A] 
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Figure 2: The confidence ellipses based on and centered at the Lasso estimator /3 l = (1-15, 0)' (red) and 
the smaller one based on and centered at the LS estimator /3 ls = (1.35,0.17)' (blue), respectively. 


Proposition 5. For any M C IRp, the set 

U U 

m£M dG{-l,l}P 

is the smallest set eontaining M that satisfies Condition P] 

We now provide an example for p = 2 illustrating the difference between the confidence ellipse based 
on the LS estimator and the one based on the Lasso, as well as how to choose a better shape in terms 
of volume for the confidence set based on the Lasso estimator. The simulations and calculations were 
carried out using the statistical software package R. The example is set up in the following way. We let 
n = 20 and generate the (n x 2)-matrix X using independent and identically distributed standard normal 
entries that are transformed row-wise by an appropriate (2 x 2)-matrix in order to get 

-f L T)- 

n \^—0.5 1 J 

We generate the data vector y from the corresponding linear model with = 1 (so that e ~ 7V(0,/„)) 
and true parameter chosen as /3 = (1,0)'. We compute the Lasso estimator using the glmnet-package 
and tuning parameters A„p = Xn ,2 = '/nf2 (asymptotically corresponding to what we will refer to as 
conservative model selection in the subsequent section). We also considered estimators where the tuning 
parameters were chosen by 10-fold cross-validation (as provided in the glmnet-package) which ended up 
yielding comparable results for the estimator. 

We then constructed confidence ellipses with level a = 0.05 based on both the LS and the Lasso 
estimator in the manner described earlier in this section. The resulting sets are shown in Figure The 
plot clearly illustrates the above described fact that the confidence ellipse based on the Lasso estimator is 
larger than the confidence ellipse that is based on the LS estimator. Also, the two sets are overlapping by 


in the Appendix). However, the LS ellipse is not entirely contained in the one based on the Lasso, 
stressing the fact the Theorem yields non-trivial sets. 

The above comparison between the two ellipses, however, is somewhat unfair in the sense that the 
shape used for both confidence sets is the optimal one (in terms of volume) for the LS estimator, but, as 
discussed above, not for the Lasso estimator. With the optimal shape for a Lasso confidence set being 


a large amount (in fact, the maximal distance between the two estimators is controlled by Proposition 13 
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Figure 3: (a) Construction of the alternative shape based on 2^ = 4 ellipses with their centers displayed 


as dots, (b) The resulting improved confidence set with the alternative shape (blue) and the previous 
elliptic shape (red), both based on at the Lasso estimator /3 l = (1.15,0)'. 


unknown, we at least want to find a shape that improves upon the ellipse. As a basis for this, we consider 
the union of the contour sets corresponding to the distributions of u^, that is, the 2^ shifted C„-ellipses 

C/„(fc) = U EcAk)-n-^/^C-^And, 

where each set in the union is of optimal shape for the corresponding distribution of {(((. As a starting 
point, we choose k so that P{u'^ G Ec„{k) — n~^^^C~^And) = 1 — a (note that k is then simply 
the parameter of the (7„-ellipse used for the LS estimator, but any k > 0 such that Un{k) satisfies 
P{Un G Un(k)) >\ — a works). Clearly, this set is still too large and will not satisfy Condition [ a] so 
we need to address these two issues. First, we add all points necessary so that the resulting set satisfies 
Condition [A| Proposition ensures that 

U U 

meUk de{-i,i}p 

fulfills the desired condition. Note that in this particular case, it is fairly straightforward to see that 
this set is simply given by the convex hull of the shifted ellipses Un{k). Finally, to get the smallest 
set with this shape that still holds the prescribed level of coverage, we iteratively adjust the set by 
reducing the parameter k and re-calculate the minimal coverage probability of the resulting set until the 
desired minimal coverage probability is reached (up to an arbitrary level of precision). The resulting 
alternatively shaped set is depicted in Figure |(a)| showing the midpoints of the 2^* = 4 ellipses used in 
the construction and |(b)| displaying the new confidence set on top of the elliptic confidence region based 
on the Lasso as devised before. It is obvious that the new shape has slightly less volume than the ellipse. 


5 Asymptotic Framework 

We now derive asymptotic results that hold without assuming normality of the errors. Additionally to 
the assumptions in Section]^ for all asymptotic considerations, we assume that X = {x '^,..., x'^)' where 
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x\ G meaning that the regressor matrix X changes with n only by appending rows, and that 


= 


X'X 


c 


as n —> oo, where C is finite and positive definite. This setting assures consistency and asymptotic 
normality of the LS estimator. We will consider two different regimes of the asymptotic behavior of the 
tuning parameter A„ and start with the regime we call conservative tuning. 


5.1 Conservative Tuning 

In this regime and throughout this subsection, we require that 

An 


jl/2 


A G [0, oo)P 


as n —>■ oo. This im plies that Xn.-i/ n —>■ 0 for all j = 1,... ,p , which in turn implies consistency of 
/3 l (see Theorem 1 in Knight & Fu (20001 with the slight modification that in our paper we allow for 
componentwise defined tuning parameters). We let A = diag(A). 

Remark 1. Such a choice of tuning parameters indeed yields a eonservative model seleetion procedure 
in the sense that 


limsup sup Pp 

n—>-00 


(/3. = O) 


= 0 < 1 


( 2 ) 


for eaeh j = 1,... ,p. In particular, if j3j = 0, we have 


limsupP ,3 (Pj = 0^ <1. 

n .—^oo ^ ' 


The latter statement was also noted by Zou {2006) in Proposition 1. 

The following proposition implicitly states the asymptotic distribution of the estimator in a so-called 
moving-parameter framework. This proposition essentially is Theorem 5 from Knight & Fu (20001 and 
can be proven in the same manner simply by adjusting for componentwise tuning. 

Proposition 6. Assume that rA^'^jdn — 1 1 G Then — /3„) —^ u = argmin^g^p Q{u), where 

p 


Q{u) = u'Cu-2W'u + 2'^\j [l{q6B}(|tj +Uj\ - \tj\) + l{|t^-|=oo} sgn(tj)Mi] 

i=i 


(3) 


and W ^ W(0, a'^C). 

Note that the vector t takes over the role of n^/^/3 in the finite-sample version of the function, Qn, 
where the cases of = ±oo are now included in the asymptotic setting. Also, the assumption of 

n converging in is not a restriction in the sense that, by compactness of Propositionchar¬ 
acterizes all accumulation points of the distributions (with respect to weak convergence) corresponding 
to completely arbitrary sequences of fin- 

Similarly to the finite-sample case, we define u to be the unique minimizer of Q, and for d G {—1,1}^, 
we define Q^{u) = u'Cu — 2W'u -\- 2^^^^ XjdjUj with unique minimizer u‘^. We can then formulate an 
asymptotic version of Theorem 

Theorem 7. If M C satisfies Condition\^with C = C, then 

inf Pt (u G M) = min P (u‘^ G M) , 

tGR” dG{-l,l}P '' ’ 


wherevA'^NiC ^Ad,a'^C ^). 

Given this result we can, again, construct asymptotically valid confidence sets for the parameter fi 
in the following way. 













Corollary 8. If M C satisfies Corirfiizon with C = C and P {vf‘ G M) = 1 — a, 


where ifi^N(C ^Ad,a'^C then 


liminf inf P 

ri—>00 ^gEP 


(/3 e /3^ - 


= 1 — 0 . 


We find that asymptotically in the case of conservative tuning, we essentially get the same results 
as in finite samples when assuming normally distributed errors. The only difference is that the minimal 
coverage holds asymptotically and that the quantities Cn and have settled to their limiting 

values C and A, respectively. 

5.2 Consistent Tuning 

In the second regime and throughout this subsection, we suppose that 


A 77 


7 , 1/2 


00 


for at least one j with 1 < j < p as well as 


-A, 


n,J 


for all j = 1 ,... ,p as n —> 00, where the latter condition ensures estimation consistency of the estimator. 
We refer to this regime as consistent tuning to highlight the contrast to conservative tuning where 
A„j/ni/^ converges for each j = 1, ... ,p. Yet we emphasize that in order to ensure Pp^Pi^j = 0) —)■ 1 
whenever (3j = 0 , we would need A„j/n^/^ —> 00 for each j = l,...,p as well as need additional 
conditions on the regressor matrix X. We refer the reader to Zou ( |2006 1 , Zhao & Yu ( 2006| and Yuan 
& Lin (20071 for a discussion concerning necessary and sufficient conditions on X in this context. 

In the case of consistent tuning, the rate of the estimator is no longer n“^/^, neither when looked 


at in a fixed-parameter asymptotic framework (as has been noted by Zou (20061 in Lemma 3), nor (a 
fortiori) within a moving-parameter asymptotic framework, as discussed in in Potscher & Leeb (20091 


in Theorem 2. The latter reference shows that the correct (uniform) convergence rate depends on the 
sequence of tuning parameters A„. Since we allow for componentwise tuning, in fact, the rate depends on 
the largest component of the vector of tuning parameters, as can be seen from the following proposition. 
We define 

A* = max A„ , 


i<i<p 


and Ao = (Ao,i,..., Ao,p)' by 


An,i/A* - Aqj G [0,1] 


for each j = 1,... ,p as n —>■ 00 . Note that Xqj = 1 for all j in case all components are equally tuned. 


C e 


Then rr(/3i - /3)/A* m = argmm„gRp V‘^{u), 


[l{CjGB}(|iii + Cil ICtI) + ^{IOI=oo} ®Sii(Ci)iii] • 


Proposition 9. Assume that n/3„/A* 
where 

p 

V‘^{u) =u'Cu + 2j2\j 

i=i 

(In contrast to the finite-sample and the conservative case, we make the dependence of the objective 
function V‘’ on the unknown parameter G apparent in the notation to clarify what we do in the 
following). Propositionshows that A*/n is indeed the correct (uniform) convergence rate as the limit 
of n(/3L — /3)/A* is not 0 in general. The proposition also reveals that in the consistently tuned case, 
when scaled according the correct convergence rate, the limit of the sequence of estimators is always 
non-random, a fact that in a moving-parameter asymptotic framework has already been noted in the 


one-dimensional case in Potscher & Leeb (20091. This fact allows us to construct very simple confidence 
sets in the case of consistent tuning by first observing that the limit of — /3)/A* is always contained 
in a bounded set which is described in Proposition To this end, define the set 


TW = M arg min W’(u) 

CGB*’ 

and note that the following can be shown. 


( 4 ) 
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Proposition 10. The set A4 can be written as 


{m e : \{Cm)j \ < Aqj, 1 < j < p} = C {z € : |zj| < Aqj, 1 < p} • 


Thus At can be viewed as a box distorted by the linear function a bounded set in In fact, 
this turns out to be a parallelogram whose corner points are given by the set {C^^Aod : d G {—1,1}^}, 
where Aq = diag(Ao). Note that fittingly, these corner points can be viewed as the equivalent of the 
means in the normal distributions (determining the minimal coverage probability) in the conservative 
case in Theorem appearing without randomness in the limit in the consistently tuned case. Using 
Proposition a simple asymptotic confidence set can now be constructed as is done in the following 
corollary. 


Corollary 11. We have 

for any d > 1 and 

for any d <1. 


lim inf d-^M = 1 

n^oo /36EP \ n ' 


lim inf Pr { P G [ 3 ^ — d—M = 0 
n-J-oo /3 gRp V n ' 


Note that nothing can be said about the boundary case d = 1. This corollary is a generalization of 


the simple confidence interval given in Proposition 6 in Potscher & Schneider (20101. Finally, also note 


the set A1 is not required to satisfy Condition and, in fact, will not comply with this condition for 
certain matrices C. 


6 Conclusion 

We consider confidence regions based on the Lasso estimator covering the entire unknown parameter 
vector thereby quantifying estimation uncertainty of this estimator. We provide exact formulas for the 
minimal coverage probability of these regions in finite samples and asymptotically in a low-dimensional 
framework when the estimator is tuned to perform conservative model selection. We do this without 
explicit knowledge of the distribution but by carefully exploiting the structure of the optimization problem 
that defines the estimator. The sets we consider as confidence regions need to satisfy certain shape 
constraints which apply to the regular confidence ellipse based on the LS estimator. We show that the 
LS confidence ellipse is always smaller than the one based on the Lasso estimator, but not contained 
in the Lasso ellipse in general. An ellipse is not the optimal shape for the confidence region based on 
the Lasso estimator in terms of volume. We give some guidelines on how to construct regions of smaller 
volume. We show how a set can be minimally enlarged in order to comply with the imposed shape 
condition, allowing to start the construction with sets of arbitrary shapes. 

In the consistently tuned case, we give a simple asymptotic confidence regions in the shape of a 
parallelogram that is determined by the regressor matrix. 
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7 Proofs 

We start the proof section with introducing some notation that will be used throughout this section. Let 
Bj denote the j*** unit vector in and let i = (1,..., 1)' G For a vector d G {—1,1}^, we define 0‘^ 
to be the corresponding orthant of that is, = {z G : djZj > 0} and O'^ to be the corresponding 
orthant of that is, (5^ = {z G : djZj > 0}. By O''^^ we denote the orthant with strictly positive 
components only, that is, 0)^^ = {z G : Zj > 0}. The sup-norm on is denoted by ||.||oo- 

To remind the reader of some notation relevant for the following proofs that was introduced previously 
throughout the paper, note that Un = — /3), where Un is the minimizer of Qn, and Uls = 


10 





— P)- The minimizer of was labeled u'^. The asymptotic versions in the conservatively tuned 
case were labeled u and Q, as well as u'^ and respectively. 

The directional derivative of a function 5 : —>■ K at u in the direction of r € \ {0} is defined as 

dg{u) g{u + hr)-g{u) 

- 1 -• 

or h\o h 


7.1 Proofs for Section |3] 

In order to prove the main theorem, we start by re-writing Condition For m and a, p x p matrix 
C, we define 

aP .{m) = {z e : dj{Cm)j < dj{Cz)j,djZj < 0} and 

bP .(rn) = {z G : {Cz)j = {Cm)j,djZj > 0} 

for j = 1,... ,p. Note that clearly we have 

= n 

i=i 

and that, in fact, also the following lemma holds. 

Lemma 12. 

U U 

i=i de{-i,i}p i=i 


Proof. We fix m and C, drop the corresponding subscripts and show that the set on the left-hand side 
of the equation contains the set on the right-hand side of the equation. To this end, take any z from 
the set on right-hand side. Then there exists a d G {—1, l}'^ such that for each j = 1,... ,p, z is either 
contained in or in B‘^\ We pick / G {—1,1}^ in the following way: if z G A'^\ set fj = dj and if 

z G set fj = —dj. Then, by construction, z G A^ for all j = 1,... ,p and therefore z G so 

that z is contained in the set on the left-hand side of the equation. □ 


Since needed later on, we also prove the following proposition which quantifies the maximal distance 
between the Lasso and the LS estimator in finite samples. 

Proposition 13. For each j = 1,... ,p, we have 

or, equivalently, 

(C^n(^n TL ^ 

where Uls = n^^'^0Ls - P). 

Proof. The two inequalities above just differ by a scaling factor. We show the latter one. We have 
Hri = n~^^'^X'e = CnU-Ls- Consider the directional derivative of Qn at its minimizer in the direction 
of Cj and —e,. We have 

d 


dcj 


0 < —— Qn{Un) — 2(Cn'hn)j — 21Fnj' -|- 2 n 

^ ‘2{CnUn)j—‘^{CnUi.s)j+2n 




as well as 


d 


0 < - rQn{Un) — —‘2'{CnUn)j+2WnJ+2n 




< — 2 (C'„u„)^--|- 2 (C„ULs)j + 2 n 

Piecing the two displays above together yields the second inequality in the proposition. 


□ 
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To proceed note that Q'^ as defined in Q is a simple quadratic and strictly convex function in u with 
unique minimizer given by 

< = C-i(Vh„-n-i/2A„d), (5) 

where Wn -/V(0, We first show Theorem for one orthant of the parameter space as is 

formulated in Proposition |14[ 

Proposition 14. If C Rp satisfies that 

p 

f| U C M„ 

i=i 


for all m S M„, then 


inf Pb(u„ 
/3gO‘ ^ 


eM„) = P«eM„). 


In essence, Proposition [T^ states Theorem [T] for the orthant of the parameter space where all com¬ 
ponents of P are non-negative. The condition in Proposition takes the role of Condition for the 
corresponding orthant, as will become apparent later on in the proof of Theorem 


Proof of Proposition I 4 We first show that Ppiun G Tl„) > P{Un G M„) by showing that for 

each fixed uj G il, G Mn implies that Un G Mn as long as fij > 0 for all j. For this, we first show the 
following two facts. 


(a) {Cnu‘^)j < {CnUn)j for all j = 1, ... ,p. 

Suppose there exists a jo with such that {Cnu'^)jo > {CnUn)jo and note that by ^ we have 
{Cniiffij = Wn,j — n~'^/‘^\nj for each j = 1,... ,p . Now consider the directional derivative of Qn 
at its minimizer Un in direction e,^. 


dQn ifin) 

9ejo 


= 2{CnUn)j„ - 2Wn,j„ + 2n-^/^X, 
< 2(C„u„),„ - 2Wn,jo + 




— 2,{CnUn)jo 2[CnU„)jg < 0 , 


which is a contradiction to Un minimizing Qn- 

(b) Un,j > 0 implies {CnUn)j = for any 1 < j < p. 

If Unj > 0 (and hence Unj + nfl'^Pj > 0 when fij > 0), then Qn is partially differentiable at Ur, 
with respect to the j*** component. Therefore, we have 

dQnifin) 


dui 


— ‘2{CnUn)j — 2WnJ + 2n^^'^\nj 
= 2{CnUn)j - 2(C„0, = 0. 


Now, by Facts ^ and ^ we clearly have that u„ G A(y^{Un)LIB'p,^{un) ■ So, by assumption, G M„ 
clearly implies u„ G M„ asmng as /3j > 0 for all j. We have therefore shown that 

Pp{Un G Mn) > P{Un G Mn). 


To see the reverse inequality, note that if -I- > 0 for all j, then Q„ is differentiable at Un 

and 

= 2CnUn - 2W'n + 2n-^^^Xn = 2C„U„ - 2CnU!n = 0, 
ou 

implying that Un = Un- Also note that Unj + > 0 for each j is equivalent to /3l G so that 

{Un G Mn} C {< e Mn} U O^,}. 


12 








Now let K be a bound in the sup-norm on the set {z G : ||C„z||oo < n A„||oo} and for an arbitrary 

e > 0, pick /3* G such that P(uls < kl — u^/^/3*) < e, where Uls = — P*) ~ 

Note that by Proposition this implies that 

-P/3- 01. < 0) = Pp* {Un - Uls + Mls < -n^/^/3*) < {-Kt + Uls < -n^/^/3*) < e, 

yielding 

^mf P/3{u„ G Mn) < Pp* {un G M„) < P(u^ G Mn) + £. 

Since e > 0 was arbitrary, this shows the desired inequality. □ 

Essentially, we have now sh own the main theorem for one part of the parameter space Kf*. By flipping 
signs, we can apply Proposition[^to each orthant O'^, thus obtaining the formula for the inflmal coverage 
over the whole space. 

Proof of Theorem^ First note that 

inf P /3 {tin G M„) = min inf P 3 (u„ € M^). 

/SgRp de{-l,l}p/3 gO‘^ 


Thus, if we can show that 


inf Pi3{un G M„) = G M„) 

peO'i 


for each d G {—1,1}^, the proof is done. Now, fix d and set D = diag(d). We consider the function 


Qniu) = Qn{Du) = u'DCnDu — 2u'DWn + \djUj+n^^'^l3j\ — \n^^'^l3j\ 


3 = 1 


= u'CnU — 2u'Wn + 2n ^ Xnj \uj + n^^'^djjdjl — \n^^'^djj3j 




where C„ = DCnD, Wn = DWn ~ A^(0, We write £t„ for the minimizer of Q„, and, analogously 

to Sectionj^ we define u!;^ to be the minimizer of the function u'CnU — 2u'Wn + ^njUj- 

If we can show that the set PM„ satisfies the requirement of Proposition 14 with the matrix Cn in 
place of Cn, we may conclude that 


inf P/siun e DMn) = P{u‘;^ G DMn). 

p:djPj>0 

Note that Un = Dun, Un = and D~^ = D, so that 

inf P{un G M„) = inf P{un G DMn) = P{Un G DMn) = P(ut G M„), 

/3gCI‘* 

which proves the formula for the inflmal coverage probability. We now show that the set DMn satisfies 
that 

p 

f| A‘'c,^0Dm) U B^^^^^iDm) C DMn 

j=i 

for all m G M„. A straightforward calculation shows that this is equivalent to 

p 

f| U C Mn 

i=i 

for each m £ M which clearly holds by Condition and Proposition 

The distributional result on u'^ immediately follows by (j^. □ 
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7.2 Proofs for Section |4] 

Proof of Proposition^ Let m G Ec„{k) and y G We show that y G Ec„{k). Remember that 

D = diag(d) satisfies DD = Ip. Since y G we have —Dy G 0‘‘ and —DC{m — y) G 0‘' implying 

that 

y'C{m — y) = {Dy)'DC{m — y) >0. 

Furthermore, since (to — yyC{m — y) > 0, we have 

m'C{m — y)> y'C{m — y) > 0, 


which in turn yields 


m'Cm > m'Cy > y'Cy > 0. 


But this means that k > m'Cm > m'Cy > y'Cy and therefore y G Ec„{k). 


□ 


Proof of Proposition |7| We transform the ellipse to a sphere and the corresponding normal distribution 
to have independent components with equal variances. 

p {< e EcAk)) = P (c]!'^ui G C^J^EcJk)) , 

where Cn'^u'^ ~ iV(—Cn cr^Jp) and Cn'^Ec^{k) = {z S : ||z|p < k}. So clearly, the smallest 

probability will be achieved for the distribution with mean furthest away from the origin, which is any 
d* maximizing \\Cn over all d G {—1,1}^- □ 


Proof of Proposition^ We start by showing that for any to G K^, d G {—1,1}^, we have 

^c{y) C Ag(TO) for all y G AQ{m). (6) 

Let z G A^( 2 /). Then djZj < 0 and {Cy)j < {Cz)j for all j. But since y G to), we also have 
{Cm)j < {Cy)j for all j so that that {Cm)j < {Cz)j for all j and therefore z G A^(m), thus proving 
^ . So clearly, the set 

U U 

meM de{-i,i}p 

satisfies Condition [A} For each to G M, choose d G {—1,1}^ in such a way that dj = 1 if mj = 0 and 
dj = — sgn{mj) for mj 0. We then get to G A^(m), implying that the set in the display above actually 
contains M. □ 


7.3 Proofs for Section [5] 

Proof of Remark^ We show ([^. Note that Proposition 13 entails that 

/3l G /3ls 1/2"^"’ 

ni-jz 

where 

Bn = {z gW : \{C„z)j\ < for j = l,...,p}. 

Since XnlrA^'^ converges, we have Bn Q C~^Bs with Bs = {x G : ||x||oo < for some (5 > 0. Since 
-G C~^, the set {C“^ : n G N} is bounded in operator sup-norm by Banach-Steinhaus, so that the 
set Bn is uniformly bounded over n in sup-norm by, say, 7 > 0. We now fix a component j and show 
that liminf„_>oo inf/ 3 eRp PpiAj 0) > 0. To this end, define TZj = x {0} x Let and A 


be the positive diagonal element of ^ and C ^, respectively. Observe that 
inf PpiA.j 7 ^ 0 ) > inf Pp ({As -C 7^y = 0 

P^RP ’J g^Rp pi}/2. J 


> inf Pp{Al'^As,i + 7 < 0 or Al'^As} - 7 > 0) 
peRp 

= 2^{-}/Au) > 0 


□ 
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In order to prove Theorem we need an asymptotic version of Proposition which is formulated 
in the following. 

Proposition 15. If M CKp satisfies that 

p 

n U C M 


for all m € Mn, then 


inf Pt{u G M) = P[u'' G M). 

t^O'^ 


Proof of Proposition [75[ The first part of the proof is completely analogous to the first part of the proof of 


Proposition 14 after identifying with t and dropping the subscript n. To see the reverse inequality, 

note that for t* = (oo,..., oo) G we actually have Q = Q\ so that u = u'' in this case which already 
yields that 


inf Pt{u G M) < Pt- (u G M) = P{u‘- G M). 


□ 


Proof of Theorem^ The proof again is completely analogous to the proof of T heorem [T] after identif ying 
n^/^/3 with t, dropping the subscript n everywhere and using Proposition 15 instead of Proposition 
Also, replace by (5^ and note that 


14 


= Q{Du) = u'DCDu-2u'DW + 2^ Aj [l{tjGR}(l^j + djUj \ - \tj\) + sgn{tj)djUj] 

P 

= u Cu — 2 m IT + 2 ^ ^ + djtj\ — \djtj\) + l{|djtj|=oo} , 


i=l 


where C = BCD and W = DW. 


□ 


Proof of Corollary^ Let c = liminf„_yoo inf/ 3 gRp Pp{fd & jdi, — n Then there exists a sequence 

/3„ in such that Pp^{l3n G /3l — —)■ c. Assume that —)■ t G (if the sequence does 

not converge, pass to subsequences). Since 

PfiS&n - /3n) e M) — > c = Pt{u G M) 

as n — >■ oo in the notation of Propositionj^ Theoremthen yields c > min£;g{_i ijp P{{p G M) = 1 — a. 
To see the reverse inequality, let fin = d G {—1,1}^ and note that for this sequence, we have 


PfsAPn G /3l - n-P^M) = - /3„) G M) ^ Pt{u G M) 

as n —>■ oo, where t = (dioo,... ,dpOoy G K^. Note that for this choice of t, Pt{u G M) = Piiif G M). 
Since d G {—1,1}^ was arbitrary, c < mindg{_i ijp P{u‘^ G M) = 1 — a follows. □ 


Proof of Proposition\^ Define the function Vn{u) = n[L„(/3„ + XnU/n) — Ln(/?„)]/(A*and note that 
Vn is minimized at n.(/?L — /3n)/A*. The function Vn is then given by 


Vn{u) = u'—u - 2^u'X'e + 2 ^ 


a: 


n 


n 

'^n 


'^n 


Clearly u'X'Xujn -G u'Cu by assumption. Since A'e/A* = (n^^^/A* and —)• oo as 

well as = Op(l), the second term in the above display vanishes in probability. To treat the 
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third term, simply note that A„j/A* —>■ Xqj S [0,1] and n^„j/A* —)■ Ci € M by assumption. Piecing this 
together yields 


V„iu) m'C'm + 2^Ao,j [l{OGE}(|Mi +01 - 101) + l{|Cd=oo}Sgn(0)wi] = V‘^{u). 
i=i 


Since and are strictly convex and is non-random, it follows by Geyer (1996 1 that also the 
corresponding minimizers converge in probability to the minimizer of the limiting function. □ 


Proof of Proposition [7^ The equality of the two sets given in the display of Proposition is trivial. 
We show that the set A4 as defined in Q is equal to the set on the left-hand side and start by proving 
that Ai is contained in that set. Take any m G Ai, by definition, there exists a C S so that m is 
the minimizer of V‘’. We need to show that \{Cm)j \ < Xgj for all j. Assume that \{Cm)jg\ > Xgjg for 
some 1 < Jo ^ P- If > Aqjo we consider the directional derivative of P*’ at its minimizer m in 

the direction of to get 


dV‘’{m) 

d{-ejo) 


-2{Cm)j + 2Ao,jo [I*^{m 3 -I-O<o} “ I^{mj-i-Cj>o}] 
— 2{Cni)j + 2Ao,jo ^ 


which is a contradiction to m minimizing If {Cm)jg < —Xgjg, then consider the directional derivative 
of at m in the direction of to arrive at 


dV^{m) 

9ejo 


— 2(Cm)j + 2Aojo [l^{"ij-i- 0 >o} I^{mj-i-Cj<o}] 

< — 2(Cm)j + 2Aojg < 0, 


yielding a contradiction also. 

To see the reverse set-inclusion, we need to show that for any m S satisfying |(C'TO)j| < Aqj- for 
all j = 1 ,... ,p, there exists a C G such that m is the minimizer of P^. Let C, = —m G and consider 
the directional derivative of P^ at m in any direction r G \ {0}: 

p p p 

——— = 2/(7771+ 2^ Aojlr^l > ^-2|((7777)^7^1 + 2Aoj|rj| = 2^ [-|((77T7 )j| + Aqj] |rj| >0. 

7 = 1 7 = 1 7 = 1 

Since the directional derivative is non-negative in any direction r G \ {0} and P^ is (strictly) convex, 
m must be the minimizer. □ 


Proof of Corollary m We start with the case d > 1. Let c = liminf„_>oo iiif/ 3 GKi> -P/3(/3 G /3l — dX^M/n). 
By definition, there exists a subsequence Uk and elements f3n^ G such that 

(/«. G = Pp^^ - Idn,) GdM^^c 

as A: — >■ 00 . Note that dAi = {m G : \{Cm)j\ < dXgj, 1 < J < p}. Now, pick a further subsequence 
77 fc, such that A*^^ /uki converges in to, say, f. Proposition then shows that (/3 l — /nt, )/A*,^^ 
converges in probability to the unique minimizer of P^ as I —> oo. Finally, Proposition implies that 
c = 1 . 

We next look the case where d < 1. Let m = C~^Xo so that m G AA\ dAA. From the proof of 
Proposition llO we know that for ^ = —m we have m = arg min^gjjp VHu). Let /3„ = nf/Xl- By 
Proposition!^ ?7(/3 l — I3n)/X^ converges to m in -probability, so that Pp^{n0i^ — /3„)/A* G dAA) —J- 
0. □ 
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